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Abstract 

Most current sampling algorithms for high-dimensional distribu- 
tions are based on MCMC techniques and arc approximate in the 
sense that they are valid only asymptotically. Rejection sampling, 
on the other hand, produces valid samples, but is unrealistically slow 
in high-dimension spaces. The OS* algorithm that we propose is a 
unified approach to exact optimization and sampling, based on incre- 
mental refinements of a functional upper bound, which combines ideas 
of adaptive rejection sampling and of A* optimization search. We show 
that the choice of the refinement can be done in a way that ensures 
tractability in high-dimension spaces, and we present first experiments 
in two different settings: inference in high-order HMMs and in large 
discrete graphical models. 



*Work conducted during an internsphip at XRCE. 



1 Introduction 



Common algorithms for sampling high-dimensional distributions are based 
on MCMC techniques [Andrieu et am2003| [Robert and Casella||2004| , which 



are approximate in the sense that they produce valid samples only asymptot- 



ically. By contrast, the elementary technique of Rejection Sampling [Robert 



and Casella 2004 directly produces exact samples, but, if applied naively 
to high-dimensional spaces, typically requires unacceptable time before pro- 
ducing a first sample. 

The algorithm that we propose, OS*, is a joint exact Optimization and 
Sampling algorithm that is inspired both by rejection sampling and by clas- 
sical A* optimization, and which can be applied to high-dimensional spaces. 
The main idea is to upper-bound the complex target distribution p by a sim- 
pler proposal distribution q, such that a dynamic programming (or another 
low-complexity) method can be applied to q in order to efficiently sample or 
maximize from it. In the case of sampling, rejection sampling is then applied 
to q, and on a reject at point x, q is refined into a slightly more complex q' 
in an adaptive way. This is done by using the evidence of the reject at x, 
implying a gap between q{x) andp(x), to identify a (soft) constraint implicit 
in p which is not accounted for by q, and by integrating this constraint in q 
to obtain q'. 

The constraint which is integrated tends to be highly relevant and to 
increase the acceptance rate of the algorithm. By contrast, many constraints 
that are constitutive of p are never "activated" by sampling from q, because 
q never explores regions where they would become visible. For example, 



anticipating on our HMM experiments in section 3.1 there is little point 
in explicitly including in g a 5-gram constraint on a certain latent sequence 
in the HMM if this sequence is already unlikely at the bigram level: the 
bigram constraints present in the proposal q will ensure that this sequence 
will never (or very rarely) be explored by q. 

The case of optimization is treated in exactly the same way as sampling. 
Formally, this consists in moving from assessing proposals in terms of the Li 
norm to assessing them in terms of the L^o norm. Typically, when a dynamic 
programming procedure is available for sampling (Li norm) with q, it is also 
available for maximizing from q (Lqo norm), and the main difference between 
the two cases is then in the criteria for selecting among possible refinements. 

Related work In an heuristic optimization context the two interesting, 
but apparently little known, papers [Kam and Kopec 1996, Popat et al. 



2001 , discuss a technique for decoding images based on high-order lan- 
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guage models for which upper-bounds are constructed in terms of simpler 
variable-order models. Our application of OS* in section [SJ] to the problem 



of maximizing a high-order HMM is similar to their technique; however Kam 



and Kopec 1996, Popat et al. , 2001 do not attempt to generalize their ap- 
proach to other optimization problems amenable to dynamic programming 
or discuss any connection to sampling. 

In order to improve the acceptance rate of rejection sampling, one has to 
lower the proposal q curve as much as possible while keeping it above the p 



curve. In order to do that, some authors Gilks and Wild, 1992, Gorur and 



Teh, 2008 , have proposed Adaptive Rejection Sampling (ARS) where, based 
on rejections, the q curve is updated to a lower curve q' with a better accep- 
tance rate. These techniques have predominantly been applied to continuous 
distributions on the one-dimensional real line where convexity assumptions 
on the target distribution can be exploited to progressively approximate 
it tighter and tighter through upper bounds consisting of piecewise linear 
envelopes. 

Also in the context of rejection sampling, [Mansinghka et aL 2009| con- 
siders the case of a probabilistic graphical model; it introduces an heuristi- 
cally determined order of the variables in this model and uses this (fixed) 
order to define a sequence of exact samplers over an increasing set of vari- 
ables, where the exact sampler over the first k + 1 variables is recursively 
obtained by using the preceding exact sampler over the first k variables and 
accepting or rejecting its samples based on the {k + 1)*^ variable. While our 
experiments on graphical models in section 3.2 have some similarities to this 
approach, they do not use a cascade of exact samplers, but rather they par- 
tition the space of configurations dynamically based on rejects experienced 
by the current proposal. 



2 The OS* algorithm 

OS* is a unified algorithm for optimization and sampling. For simplicity, 
we first present its sampling version, then move to its optimization version, 
and finally get to the unified view. We start with some background and 
notations about rejection sampling. 

2.1 Background 

Let p : X — 7- be a measurable Li function with respect to a base measure 
/i on a space X, i.e. p{x)dfi{x) < oo. We define p{x) = -r — 7§^-t^- The 
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function p can be seen as an unnormalized density over X, and p as a 
normalized density which defines a probability distribution over X, called 
the target distribution, from which we want to sample fron|^ While we 
may not be able to sample directly from the target distribution p, let us 
assume that we can easily compute p{x) for any given x. Rejection Sampling 



(RS) [Robert and Casella 2004 then works as follows. We define a certain 
unnormalized proposal density q over X, which is such that (i) we know how 
to directly sample from it (more precisely, from q), and (ii) q dominates p, 
i.e. for all x G X,p(x) < q{x). We then repeat the following process: (1) we 
draw a sample x from g, (2) we compute the ratio r{x) = p[x)/q{x) < 1, (3) 
with probability r{x) we accept x, otherwise we reject it, (4) we repeat the 
process with a new x. It can then be shown that this procedure produces 
an exact sample from p. Furthermore, the average rate at which it produces 



these samples, the acceptance rate, is equal to P{X)/Q{X) [Robert and 



Casella, 2004], where for a (measurable) subset A of X, we define P{A) = 
J^p{x)d^{x) and similarly with Q. In Fig. [l| panel (SI), the acceptance 
rate is equal to the ratio of the area below the p curve with that below the 
q curve. 

2.2 Sampling with OS* 

The way OS* does sampling is illustrated on the top of Fig. [T| In this 
illustration, we start sampling with an initial proposal density q (see (SI)). 
Our first attempt produces xi, for which the ratio rg{xi) = p{xi) / q{xi) is 
close to 1; this leads, say, to the acceptance of xi. Our second attempt 
produces X2, for which the ratio rq{x2) = p{x2)/q{x2) is much lower than 1, 
leading, say, to a rejection. Although we have a rejection, we have gained 
some useful information, namely that p{x2) is much lower than q{x2), and 
we are going to use that "evidence" to define a new proposal q' (see (S2)), 
which has the following properties: 

• One has p{x) < q'{x) < q{x) everywhere on X. 

• One has q'{x2) < q{x2). 

One extreme way of obtaining such a q' is to take: 

, {p{x) \ix = X2 

I q[x) it X ^ X2 



^By abuse of language, we will also say that a sample from p is a sample from p. 
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Figure 1: Sampling with OS* (SI, S2), and optimization with OS* (01, 02). 



which, when the space X is discrete, has the effect of improving the accep- 
tance rate, but only slightly so, by insuring that any time happens to 
select X2, it will accept it. 

A better generic way to find a is the following. Suppose that we 
are provided with a small finite set of "one-step refinement actions" a^, 
depending on q and X2, which are able to move from g to a new q'^ = aj{q, X2) 
such that for any such aj one has p{x) < q'j{x) < q{x) everywhere on X and 
also qj{x2) < q{x2)- Then we will select among these aj moves the one 
that is such that the Li norm of q'j is minimal among the possible j's, or 
in other words, such that J^q'-{x)d^{x) is minimal in j. The idea there is 
that, by doing so, we will improve the acceptance rate of q'^ (which depends 
directly on as much as possible, while (i) not having to explore a too 

large space of possible refinements, and (ii) moving from a representation 
for q to an only slightly more complex representation for q'- , rather than to a 
much more complex representation for a q' that could result from exploring 
a larger space of possible refinements for gj^ The intuition behind such one- 
step refinement actions aj will become clearer when we consider concrete 
examples later in this paper. 

^In particular, even if we could find a refinement q' that would exactly coincide with p, 
and therefore would have the smallest possible Li norm, we might not want to use such a 
refinement if this involved an overly complex representation for q' . 
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2.3 Optimization with OS* 

The optimization version of OS* is illustrated on the bottom of Fig. [l| where 
(01) shows on the one hand the function p that we are trying to maximize 
from, along with its (unknown) maximum p* , indicated by a black circle 
on the p curve, and corresponding to x* in X. It also shows a "proposal" 
function q which is such — analogously to the sampling case — that (1) 
the function q is above p on the whole of the space X and (2) it is easy to 
directly find the point xi in X at which it reaches its maximum q* , shown 
as a black circle on the q curve. 

A simple, but central, observation is the following one. Suppose that 
the distance between q{xi) and p{xi) is smaller than e, then the distance 
between q{xi) and p* is also smaller than e. This can be checked immediately 
on the figure, and is due to the fact that on the one hand p* is higher than 
p{xi), and that on the other hand it is below q{x*), and a fortiori below 
q{xi). In other words, if the maximum that we have found for g is at a 
coordinate xi and we observe that q{xi) — p{xi) < e, then we can conclude 
that we have found the maximum of p up to e. 

In the case of xi in the figure, we are still far from the maximum, and 
so we "reject" xi, and refine q into q' (see (02)), using exactly the same 
approach as in the sampling case, but for one difference: the one-step refine- 
ment option Oj that is selected is now chosen on the basis of how much it 
decreases, not the Li norm of q, but the max of g — where, as a reminder, 
this max can also be notated H^Hoo, using the Loo norm notationj^ 

Once this q' has been selected, one can then find its maximum at X2 and 
then the process can be repeated with qi = q,q2 = q' , ■■■ until the difference 
between qk{xk) and p{xk) is smaller than a certain predefined threshold. 



2.4 Sampling Li vs. Optimization 

While sampling and optimization are usually seen as two completely distinct 
tasks, they can actually be viewed as two extremities of a continuous range. 



when considered in the context of Lp spaces lAsh and Doleans-Dade 



1999 



If {X, /i) is a measure space, and if / is a real- valued function on this 
space, one defines the Lp norm for 1 < p < oo as: 

. i/P 

{x)dn{x) 



[ \f\' 
Jx 



formal definition of that norm is that ||g||oo is equal to the "essential supremum" 
of q over (X, /i) (see below), but for all practical purposes here, it is enough to think of 
this essential supremum as being the max, when it exists. 
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One also defines the the Lqo 

norm ||/||oo '^s: 
, = inf{C > : < C for almost every x}, 



where the right term is called the essential supremum of |/|, and can be 
thought of roughly as the "max" of the function. So, with some abuse 
of language, we can simply write: ||/||oo = max^gx |/|. The space Lp, for 
1 < p < oo, is then defined as being the space of all functions / for which 
< oo. 



IP 

Under the simple condition that ||/||p < oo for some p < oo, we have: 
lim^ 



The standard notion of sampling is relative to Li. However we can intro- 
duce the following generalization — where we use the notation Lq instead 
of Lp in order to avoid confusion with our use of p for denoting the target 
distribution. We will say that we are performing sampling of a non-negative 
function f relative to La{X,fj,), for 1 < a < oo, if / G Lq,(X, /x) and if 
we sample — in the standard sense — according to the normalized density 
distribution f{x) = j /(iy'dfiix) ' ~ ^® ^'"^ Xh^sX we 

are sampling relative to Loo {X-, fJ.), f £ Loo (X, //) and if we are performing 
optimization relative to /, more precisely, if for any e > 0, we are able to 
find an x such that |||/||oo — fix)\ < ^■ 

Informally, sampling relative to La "tends" to sampling with L^o (i.e. 
optimization), for a tending to oo, in the sense that for a large a, an x 
sampled relative to Lq "tends" to be close to a maximum for /. We will not 
attempt to give a precise formal meaning to that observation here, but just 
note the connection with the idea of simulated annealing [Kirkpatrick et al. 



19831, which we can view as a mix between the MCMC Metropolis-Hastings 



sampling technique Robert and Casella, 2004 and the idea of sampling in 



La spaces with larger and larger a's. 

In summary, we thus can view optimization as an extreme form of sam- 
pling. In the sequel we will often use this generalized sense of sampling in 
our algorithms]^ 



2.5 OS* as a unified algorithm 

The general design of OS* can be described as follows: 

*Note: While our two experiments in section [s] are based on discrete spaces, the OS* 
algorithm is more general, and can be applied to any measurable space (in particular 
continuous spaces); in such cases, p and q have to be measurable functions, and the 
relation p < q should be read as p{x) < q{x) a.e. (almost everywhere) relative to the base 
measure jj,. 
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• Our goal is to OS-sample from p, where we take the expression "OS- 
sample" to refer to a generalized sense that covers both sampling (in 
the standard sense) and optimization. 

• We have at our disposal a family Q of proposal densities over the space 
(X, /i), such that, for every Q,we are able to OS-sample efficiently 
from q. 

• Given a reject xi relative to a proposal g, with p{xi) < q{xi), we have 
at our disposal a (limited) number of possible "one-step" refinement 
options q', with p < q' < q, and such that q'{xi) < q{xi). 

• We then select one such q'. One possible selection criterion is to prefer 
the q' which has the smallest Li norm (sampling case) or Loo norm 
(optimization). In one sense, this is the most natural criterion, as it 
means we are directly lowering the norm that controls the efficiency 
of the OS-sampling; for instance, for sampling, if q[ and ^2 two 
candidates refinements with \\q[\\i < ||q'2l|ij then the acceptance rate 
of q'l is larger than that of q'2, simply because then P(X)/Q'i{X) > 
P{X)/Q'2{X) , and similarly, in optimization, if ||(?x||oo < lk2lloo) then 
the gap between m.axx{q'i{x)) and p* is smaller than that between 
maxx{q2{x)) andp*, simply because then maxa;(g'^(x)) < maxa;(g2(a;))- 
However, using this criterion may require the computation of the norm 
of each of the possible one-step refinements, which can be costly, and 
one can prefer simpler criteria, for instance simply selecting the q' that 
minimizes q'{xi). 

• We iterate until we settle on a "good" q: either (in sampling) one 
which is such that the cumulative acceptance rate until this point is 
above a certain threshold; or (in optimization) one for which the ratio 
p{xi) / q{xi) is closer to 1 than a certain threshold, with xi being the 
maximum for q. 

The following algorithm gives a unified view of OS*, valid for both sam- 
pling and optimization. This is a high-level view, with some of the content 
delegated to the subroutines OS-Sample , Accept-or-Reject , Update, Refine, 
Stop, which are described in the text. 
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Algorithm 1 The OS* algorithm 



1: while not Stop(/i) do 
2: OS-Sample x ~ g 

3: r p{x)/q{x) 

4: Accept-or-Reject(x, r) 

5: Update(/i, x) 

6: if Rejected(2;) then 

7: g <— Refine(g, j;) 

8: return q along with accepted x's in h 



On entry into the algorithm, we assume that we are either in sample 
mode or in optimization mode, and also that we are starting from a proposal 
q which (1) dominates p and (2) from which we can sample or optimize 
directly. We use the terminology OS-Sample to represent either of these 
cases, where OS-Sample x ~ g refers to sampling an x according to the 
proposal q or optimizing x on g (namely finding an x which is an argmax of 
g), depending on the case. On line (1), h refers to the history of the sampling 
so far, namely to the set of trials xi,X2, ■■■ that have been done so far, each 
being marked for acceptance or rejection (in the case of sampling, this is the 
usual notion, in the case of optimization, all but the last proposal will be 
marked as rejects). The stopping criterion Stop(/i) will be to stop: (i) in 
sampling mode, if the number of acceptances so far relative to the number 
of trials is larger than a certain predefined threshold, and in this case will 
return on line (8), first, the list of accepted x's so far, which is already a 
valid sample from p, and second, the last refinement g, which can then be 
used to produce any number of future samples as desired with an acceptance 
ratio similar to the one observed so far; (ii) in optimization mode, if the last 
element x of the history is an accept, and in this case will return on line (8), 
first the value x, which in this mode is the only accepted trial in the history, 
and second, the last refinement q (which can be used for such purposes as 
providing a "certificate of optimality of x relative to p" , but we do not detail 
this here). 

On line (3), we compute the ratio r, and then on line (4) we decide to 
accept x or not based on this ratio; in optimization mode, we accept x if 
the ratio is close enough to 1, as determined by a thresholc^ in sampling 
mode, we accept x based on a Bernoulli trial of probability r. 

®When X is a finite domain, it makes sense to stop on a ratio equal to 1, in which 
case we have found an exact maximum. This is what we do in some of our experiments 
in section [3| 
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Figure 2: A connection with A*. 



On line (5), we update the history by recording the trial x and whether 
it was accepted or not. 

If X was rejected (line (6)), then on line (7), we perform a refinement of 
g, based on the principles that we have explained. 

2.6 A connection with A* 

A special case of the OS* algorithm, which we call "OS* with piecewise 
bounds", shows a deep connection with the classical A* optimization algo- 
rithm [Hart et ah 1968 and is interesting in its own right. Let us first focus 



on sampling, and let us suppose that qQ represents an initial proposal den- 
sity, which upper-bounds the target density p over X. We start by sampling 
with go 5 and on a first reject somewhere in X, we split the set X into two 
disjoint subsets Xi,X2-, obtaining a partition of X. By using the more pre- 
cise context provided by this partition, we may be able to improve our upper 
bound qQ over the whole of X into tighter upper bounds on each of Xi and 
X2, resulting then in a new upper bound qi over the whole of X. We then 
sample using gi, and experience at some later time another reject, say on a 
point in Xi] at this point we again partition Xi into two subsets Xn and 
X12, tighten the bounds on these subsets, and obtain a refined proposal q2 
over X] we then iterate the process of building this "hierarchical partition" 
until we reach a certain acceptance-rate threshold. 

If we now move our focus to optimization, we see that the refinements 
that we have just proposed present an analogy to the technique used by A*. 
This is illustrated in Fig. [2} In A*, we start with a constant optimistic bound 
— corresponding to our go — for the objective function which is computed 
at the root of the search tree, which we can assume to be binary. We then 
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expand the two daughters of the root, re-evaluate the optimistic bounds 
there to new constants, obtaining the piecewise constant proposal qi, and 
move to the daughter with the highest bound. We continue by expanding 
at each step the leaf of the partial search tree with the highest optimistic 
bound (e.g. moving from qi to q2, etc.)j^ 

We will illustrate OS* sampling with (non-constant) piece-wise bounds 



in the experiments of section 3.2 



3 Experiments 
3.1 HMMs 

Note: An extended and more detailed version of these experiments is pro- 



vided in Carter et al., 20121 



The objective in our HMM experiments is to sample a word sequence 
with density p[x) proportional to p{x) = p\ya{x) Pohs{o\x)^ where p\^ is the 
probability of the sequence x under an n-gram model and Pohs{o\x) is the 
probability of observing the noisy sequence of observations o given that the 
word sequence is x. Assuming that the observations depend only on the 
current state, this probability can be written: 



p{x) 



4 = 1 



{Xi 



-n+l) 



Pohs{0i\Xi) 



Approach Taking a tri-gram language model for simplicity, let us de- 
fine w^{xi\xi-2Xi^i) = p\ja{xi\xi-2Xi^i) Pohs{oi\xi). Then consider the ob- 
servation o be fixed, and write p{x) = 'W^u]^{xi\xi-2Xi-i). In optimiza- 
tion/decoding, we want to find the argmax of p{x), and in sampling, to 
sample from p{x). Note that the state space associated with p can be huge, 

^OS*, when used in optimization mode, is in fact strictly more general than A* , for two 
reasons: (i) it does not assume a piecewise refinement strategy, namely that the refinements 
follow a hierarchical partition of the space, where a given refinement is limited to a leaf of 
the current partition, and (ii) even if such a stategy is followed, it does not assume that 
the piecewise upper-bounds are constant. Both points will become clearer in the HMM 
experiments of section |3.1[ where including an higher-order n-gram in q has impact on 
several regions of X simultaneously, possibly overlapping in complex ways with regions 
touched by previous refinements; in addition, the impact of a single n-gram is non-constant 
even in the regions it touches, because it depends of the multiplicity of the n-gram, not 
only on its presence or absence. 
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Figure 3: An example of an initial q-automaton (a), and its refinement (h). 

as we need to represent explicitly all contexts {xi^2-,Xi^i) in the case of a 
trigram model, and even more contexts for higher-order models. 

We define W2{xi\xi-i) = max^^^ ^ ""^sC^^il^j-za^i-i), along with wi{xi) = 
max2:._j W2{xi\xi-i)^ where the maxima are taken over all possible context 
words in the vocabulary. These quantities, which can be precomputed 
efficiently, can be seen as optimistic "max-backoffs" of the trigram x\_2, 
where we have forgotten some part of the context. Our initial proposal 
is then qo{x) = W,iWi{xi). Clearly, for any sequence x of words, we have 
p{x) < qo{x)- The state space of qo is much less costly to represent than 
that of p{x). 

The proposals qt, which incorporate n-grams of variable orders, can be 
represented efficiently through WFSAs (weighted FSAs). In Fig. ^a), we 
show a WFSA representing the initial proposal qo corresponding to an ex- 
ample with four observations, which we take to be the acoustic realizations 
of the words 'the, two, dogs, barked'. The weights on edges correspond 
only to unigram max-backoffs, and thus each state corresponds to a NULL- 
context. Over this WFSA, both optimization and sampling can be done 



efficiently by the standard dynamic programming techniques (Viterbi iRa- 



biner 1989 and "backward filtering- forward sampling" Scott, 2002] ), where 



the forward weights associated to states are computed similarly, either in 
the max-product or in the sum-product semiring. 

Consider first sampling, and suppose that the first sample from qQ pro- 
duces xi = the two dog barked, marked with bold edges in the drawing. Now, 
computing the ratio p{xi)/qo{xi) gives a result much smaller than 1, in part 
because from the viewpoint of the full model p, the trigram the two dog is 
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very unlikely; i.e. the ratio W2,{dog\the two) /wi{dog) is very low. Thus, with 
high probability, xi is rejected. When this is the case, we produce a refined 
proposal qi, represented by the WFSA in Fig.jsj^b), which takes into account 
the more realistic bigram weight W2{dog\two) by adding a node (node 6) for 
the context two. We then perform a sampling trial with gi, which this time 
tends to avoid producing dog in the context of two; if we experience a reject 
later on some sequence X2, we refine again, meaning that we identify an 
n-gram in X2, which, if we extend its context by one (e.g from a unigram to 
a bigram or from a bigram to a trigram) , accounts for some significant part 
of the gap between qi{x2) and p{x2)- We stop the refinement process when 
we start observing acceptance rates above a certain fixed threshold. 

The case of optimization is similar. Suppose that with qo the maximum 
is xi = the two dog barked, then we observe that p{xi) is lower than qQ^xi), 
reject xi and refine qo into qi. We stop the process at the point where the 
value of qt, at its maximum Xq^, is equal to the value of p at Xq^, which 
implies that we have found the maximum for p. 



Setup We evaluate our approach on an SMS-message retrieval task. Let 
N be the number of possible words in the vocabulary. A latent variable 
X G {!,•■■ i^Y represents a sentence defined as a sequence of £ words. 
Each word is converted into a sequence of numbers based on a mobile phone 
numeric keypad, assuming some level of random noise in the conversion. 
The task is then to recover the original message. 

We use the English side of the Europarl corpus [Koehn , 2005] for training 
and test data (1.3 million sentences). A 5-gram language model is trained 
using SRILM [Stolcke 2002| on 90% of the sentences. On the remaining 
10%, we randomly select 100 sequences for lengths 1 to 10 to obtain 1000 
sequences from which we remove the ones containing numbers, obtaining a 
test set of size 926. 



Optimization We limit the average number of latent tokens in our decod- 
ing experiments to 1000. In the top plot (a) of Fig. |4] we show the number 
of iterations (running Viterbi then updating q) that the different n-gram 
models of size 3, 4 and 5 take to do exact decoding of the test-set. For a 
fixed sentence length, we can see that decoding with larger n-gram models 
leads to a sub-linear increase w.r.t. n in the number of iterations taken. 

To demonstrate the reduced nature of our q-automaton, we show in 
Table [T] the distribut ion of n-grams in our final model for a specific input 
sentence of length 10. The total number of n-grams in the full model would 
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Figure 4: SMS-retrieval experiment, (a): optimization; (b) and (c): sam- 
pling. 



be ~3.0x 10^^; exact decoding here is not tractable using existing techniques. 
By comparison, our HMM has only 118 five-grams and 9008 n-grams in total. 



n: 


1 


2 


3 


4 


5 


q: 


7868 


615 


231 


176 


118 



Table 1: # of n-grams in our variable- order HMM. 



Sampling For the sampling experiments, we limit the number of latent 
tokens to 100. We refine our q automaton until we reach a certain fixed 
cumulative acceptance rate (AR). We also compute a rate based only on the 
last 100 trials (AR-100), as this tends to better refiect the current acceptance 
rate. In plot (b), at the bottom of Fig. |4| we show a single sampling run 
using a 5- gram model for an example input, and the cumulative # of accepts 
(middle curve) . It took 500 iterations before the first successful sample from 
p. 

We noted that there is a trade-off between the time needed to compute 
the forward probability weights needed for sampling, and the time it takes 
to adapt the variable-order HMM. To resolve this, we use batch- updates: 
making B trials from the same (^-automaton, and then updating our model 
in one step. By doing this, we noted significant speed-ups in sampling times. 
Empirically, we found B = 100 to be a good value. In plot (c), we show 
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the average # of iterations in our models once refinements are finislied (AR- 
100=20%) for different orders n over different lengtlis. We note a sub-linear 
increase in the number of trials when moving to higher n; for length=10, 
and for n = 3,4,5, average number of trials: 3-1105, 4-1238, 5-1274. 



3.2 Discrete Probabilistic Graphical Models 

Approach The OS* approach can be applied to exact sampling and opti- 
mization on graphical models with loops, where the objective function takes 
the form of a product of local potentials: 

pix)= llMx)llM^), (2) 

where (A^, £) defines an undirected graph with nodes M and edges S. The 
unary potential functions are denoted by G J\f and the binary poten- 

tials by (/)e,e S £■ Since integrating and sampling from Q can be done 
efficiently for trees, we first determine a spanning tree T of the graph £. 
Let us denote by (j)^'^^ and i;^™™ the maximal and minimal values of the 
potential for edge e. If we define: 

Qi^) = n ^"(^) n n ^^^^^ 

neAf eer ee£~T 

then q is an upper-bound for p over 

To improve this upper bound, we use the conditioning idea (see e.g. 
[Roller and Friedman 2009 , chap. 9.5) which corresponds to partition- 



ing the configuration space X. Assume we observe all the possible values 
1,2,-- - ,K of a node, say Xi, having the set of incident edges Cj. The 
restriction of p to the subspace Xi^k = {x € X; Xi = k} can be written as: 

Pi,k(.^) = ^ii^) n n ^""^^^ n ^^^^^^ > 

n&Af-{i} eeCi ee£-Ci 

^ ' V ' 

unary potentials binary potentials 

where the binary potentials of edges incident to Xi have now been absorbed 
into unary potentials associated with neighboring nodes to Xj, and where 



^Any choice of tree produces an upper-bound, but it is advantageous to choose one 
for which q is as "close" as possible to p, which we heuristically do by using Prim's 
algorithm [Prim 1957 for selecting a maximum spanning tree on the graph having weights 



log((^™''''/<?!>™'"), e £ S. The intuition is that edges with nearly constant potentials create 
a small gap between the exact value (j)e (x) and the the bound (j>f^^ and can be left outside 
the tree. 
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we have used the informal notation ^^Ji{k) in an obvious way. Note that, by 
doing so, we have ehminated from the graph the edges incident to Xi, which 
may have been participating in loops; in particular, if in equation ([s]) we 
had some edge e £ £ — T incident to Xi, then this edge is now absorbed 
in one of the neighbors of Xi, which implies that the maximum (p^^^ has 
now been replaced by its exact value, which is necessarily lower, implicitly 
refining q. By conditioning with respect to all the possible values of Xi, 
we obtain K different graphs with the node Xi removed, in other words we 
have partitioned the event space into K subspaces; we then define on each 
such subspace a proposal qi^k{x) as in equation ([s]), which is lower than the 
restriction of q to fc. We then define the refinement q' of q on the whole 
of X as q'{x) = Ylik1i,ki^)-> where qi^k{x) is taken to be null for x Xi^k- 
This scheme can be seen to be an instance of OS* with piecewise bounds. 

If we repeat this process iteratively based on observed rejections in one 
of the subspaces, we obtain a hierarchical partition of X which is more 
fine-grained in some regions of X than in others. The refinements obtained 
have the form ([3|, but on reduced graphs in which we have introduced 
the evidences given by the conditioned variables. While the cardinality 
of the hierarchical partition may grow exponentially, we can monitor the 
acceptance rate/computation time ratio, as we will see below. 



Ising model experiment In what follows, we only consider exact sam- 
pling, the problem of MAP estimation (i.e. optimization) in graphical mod- 



els having been thoroughly investigated over the past decade (e.g. Sontag 



et al. , 2008| ). One interesting question is to understand the trade-off be- 



tween improving the acceptance rate and incurring the cost of performing a 
refinement. An a priori possible policy could be to first refine the proposals 
up to a certain point and only then sample until we get the required number 
of samples. As the experiments will show, there are however two reasons to 
interleave sampling with refining: the first is that observing rejected samples 
from q helps to choose the refinements, as argued previously; the second is 
to have a criterion for stopping the refinement process: the computation of 
this criterion requires an estimate of the current acceptance rate which can 
be estimated from the samples. 

We consider a Ising model of 100 binary variables on a 10x10 uniform 
grid with unary potentials and binary coupling strengths drawn according 
to a centered normal distribution with standard deviation 0.1. Note that, 
in certain cases, exact sampling for Ising models can be done in polynomial 
time |Ullrich[ [2010] by using an elegant MCMC approach called coupling 
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Acceptance probability vs. number of refinements 



Expected compulation time vs. acceptance ratio 



- random split in the region of the rejected sample 

- highest bound improvement on rejected sample 
" most probable region 

highest acceptance rate 




number of refinements 



acceptance probability 



Figure 5: Comparison of different refinement policies for the piecewise bound 
on a 10x10 Ising model grid. 



from the past Propp and Wilson 1996| . It is based on the fact that two 



properly coupled Markov chains follow exactly the target distribution at 
the time they coaelesce. However, applications of this approach rely on 
certain strong assumptions on the potentials, typically that they are either 
all attractive or all repulsive [Craiu and Meng 2011 , while we sample models 
with random positive or negative coupling strengths, making the problem 
much harder. One interesting extension of our work would be to use these 
algorithms as proposal distributions for more general models. 

The OS* algorithm was run using 4 different policies for the choice of 
the refinements, where a policy is a function that takes as input the cur- 
rent proposal and returns a (subspace, conditioning-variable) pair. The first 
two policies in addition use a rejected sample x as input, while the last two 
do not and are deterministic: (i) random split in the region of the rejected 
sample means that once an observation has been rejected, we refine the 
subspace which contains the sample by conditioning with respect to one of 
the remaining variables Xi selected at random; (ii) highest bound improve- 
ment on rejected sample is the same but the index i of the conditioning 
variable is selected such that refining on Xi leads to the largest decrease in 
the value of q{x) (for this specific rejected sample x); this is a similar re- 
finement strategy to the one we used in the HMM experiments; (iii) most 
probable region refines a variable (selected uniformly at random) in the most 
probable subspace of the piecewise bound q; (iv) highest acceptance rate is 
the most "ambitious" greedy policy where the largest reduction of the total 
mass q{X) of the proposal is identified among all the possible choices of 
(subspace, conditioning- variable) . This has been implemented efficiently by 
maintaining a priority queue containing all the possible triples (subspace, 
conditioning- variable, maximal-bound- improvement). The acceptance rates 



17 



obtained by following these policies are compared on Figure [s] (left) by run- 
ning 2000 refinements with policies (i) and (iii), 800 refinements with policy 
(ii) and 400 refinement with policy (iv). Results confirm that the best refine- 
ment is obtained using the deterministic policy (iv). The policy (ii) based 
on rejected samples reaches the same acceptance rates using twice as many 
refinements. The other two policies (i) and (iii) are very naive and do not 
reach a significant acceptance rate, even after 400 iterations. Based on these 
results, one might conclude that the policy (iv) is the best. However, we 
will now see that this is not the case when the computation time is taken 
into account. 

After T trials, the partition function oip, namely p(X), can be estimated 
(without bias) by Zt = ^ Ylt=i Qt{X) where the observed accept ratios 
{rt}f^i and refinement weights {q{X)t}'[^i are used. Hence, ttt = q^^x) 
is an unbiased estimate of the current acceptance rate ttt = p{X) / qxiX). 
Note that these estimators are based on all the samples rcj, accepted or not. 
Hence, if we decide to stop refining now, the expected time to obtain n addi- 

samp 

tional exact samples from the target distribution p is approximately — ^ — , 
where t^'^^ is the average time to obtain one trial from the distribution qt- 
If we add the total time r|?^ spent in computing the set of refinements up to 
the current refinement qT, we obtain an estimate f^"* of the expected total 

samp 

time to obtain n samples: f,^"* = h . This quantity computed for 

n = 1 sample is plotted against the acceptance rate on Figure [s] (right). For 
each policy, the expected computation time starts to decrease as the accep- 
tance rate increases; in this regime, the refinement time is small compared 
to the time reduction due to a higher acceptance rate. For large values of 
the acceptance rate, the refinement time is no more negligeable, leading to 
an increase of the total computation time. We see (Figure [5] (right)) that 
the highest acceptance rate policy (iv) , despite its very good acceptance rate 
for a fixed number of refinements, requires more time in total than alter- 
native refinements policies. The difference is striking: if we look at the 
minimum of each curve (which corresponds to the optimal stopping time for 
the refinements), it is at least 10 times faster to use the policy (ii) based 
on a refinement rule applied only to the rejected sample. This experiment 
confirms the benefit of using rejected samples: by adaptively choosing the 
refinements, we spot the regions of the space that are important to refine 
much faster than by computing the best possible alternative refinement. 
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4 Conclusions and Perspectives 



In this paper, we have proposed a unified viewpoint for rejection sam- 
phng and heuristic optimization, by using functional upper-bounds for both. 
While in sampling, the upper-bounds are refined by decreasing their integral 
(Li norm), in optimization they are refined by decreasing their maximum 
(Loo norm). Depending on the problem, several classes of upper bounds 
can be used. We showed that variable-order max-backoff bounds on n-gram 
probabilities gave state-of-the art performances on the exact decoding of 
high-order HMMs. For many practical problems, simpler piecewise bounds 
can be derived, which we illustrated on the example of sampling and decod- 
ing on a large tree-width graphical model. One interesting property of the 
proposed approach is the adaptive nature of the algorithm: the rejected sam- 
ple is used to quickly choose an effective refinement, an approach which can 
be computational attractive compared to the computation of the Lp norm 
of all the potential one-step refinements. In the case of graphical model 
sampling, we showed that this can lead to a speedup factor of an order of 
magnitude. 

The results presented in this paper motivate further research in the de- 
velopment of domain-specific functional bounds. One important extension 
will be to derive bounds for agreement-based models p{.) = pi(.)p2(-) cor- 
responding to the product of two (or more) simple models pi and p2- One 
typical example corresponds to the agreement between an HMM pi and a 
probabilistic context-free grammar p2 in order to take into account both the 



syntactic and the low-level n-gram structures of the language |Rush et al. 
(2010 



Another research direction is to improve those models that are based 
on piecewise bounds: their quality can be limited since their refinement is 
always local, i.e. they can only improve the bound on a single element of the 
partition. In the example of graphical model sampling, if we condition on the 
value of a first variable, say xi, and then further refine the partition {xi = k} 
by conditioning on the value of a second variable X2, the complementary 
region of the space {xi 7^ k} will not be impacted. Intuitively, the quality 
of the bound could be improved if we could refine it jointly for xi and X2, 
in a way analogous to what was done in the context of HMMs, where the 
incorporation of one higher-order n-gram into the proposal had a global 
impact on the whole event space. 
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